MONet Community Science Meeting 2023

This RMarkdown document contains code and scripts for preliminary processing and data visualization of soil biogeochemistry data. These graphs were created for use in Session 3 (Biogeochemistry Data Tutorial) on Tuesday, November 7.

USING THIS TUTORIAL
Download the bgc_data file and bgc_analyses file. Create a new R Script file and save it in the same folder as the data files. Make sure all these files are saved in the same folder, to minimize errors with file paths.

This file is a Markdown report, which includes text as well as code chunks. You can copy the R code directly from these chunks and paste it into your R Script file.


Setup

Load packages

You will need the {tidyverse} set of packages to run this tutorial. The tidyverse includes packages like {dplyr} and {tidyr} for data wrangling, and {ggplot2} for data visualization.

library(tidyverse) # for general data wrangling and visualization
theme_set(theme_bw(base_size = 12)) # set default ggplot theme


library(ggcorrplot) # for correlation matrix plot
library(ggtern) # for soil texture triangle (ternary plot) 

You may also want to download additional packages to jazz up your figures, such as the {soilpalettes} package, which includes soil-themed color palettes.

# Install `soilpalettes` using: 
# install.packages("devtools"); devtools::install_github("kaizadp/soilpalettes")

library(soilpalettes) # for color palettes

Import files

We need two files for this tutorial. bgc_data includes biogeochemical variables, and also has site metadata such as latitude-longitude and biome type. bgc_analyses includes additional information about the analyses performed.

You will need to change the file path before using this tutorial. The current file path reflects the directory structure in the GitHub repository.
If you downloaded the tutorial and the data file into the same folder, change the code chunk below to bgc_data = read_csv("bgc_data.csv") and `bgc_analyses = read_csv("bgc_data.csv") %>% ...

bgc_data = read_csv("Biogeochem/data/bgc_data.csv")
bgc_analyses = read_csv("Biogeochem/data/bgc_analyses.csv") %>% dplyr::select(analysis, analysis_type)

Process the data

We will use the data in two forms - wideform (bgc_data is already in wideform, with each analyte as a separate column) and longform (see below).

data_long = 
  bgc_data %>% 
  pivot_longer(cols = -c(Site_Code, Location, Lat, Long, biome_name), 
               names_to = "analysis") %>% 
  left_join(bgc_analyses)

One additional processing step before we start the tutorial – The Location column refers to the depth of the soil sample, “TOP” or “BTM”.
By default, R sorts its data alphabetically. Change the order here, before proceeding.

bgc_data = 
  bgc_data %>% 
  mutate(Location = factor(Location, levels = c("TOP", "BTM")))

BASIC EXPLORATION

Basic jitter plots to determine the range and distribution of each variable. Grouped by analysis type.

data_long %>% 
  ggplot(aes(x = analysis, y = value, color = Location))+
  geom_jitter(width = 0.3)+
  facet_wrap(~analysis_type, scales = "free")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The samples were collected from sites spread across five biomes

## Create summary of biome counts
biome_numbers = 
  data_long %>% 
  distinct(Site_Code, biome_name) %>% 
  group_by(biome_name) %>% 
  dplyr::summarise(n = n()) %>% 
  force()

biome_numbers %>% knitr::kable()
biome_name n
Deserts & Xeric Shrublands 13
Mediterranean Forests, Woodlands & Scrub 1
Temperate Broadleaf & Mixed Forests 21
Temperate Conifer Forests 19
Temperate Grasslands, Savannas & Shrublands 11
NA 5

CORRELATIONS

Use this function to set up the correlation plots for the entire dataset.
We use the ggcorrplot package for this.

Creating a correlation matrix plot can be complicated. We have simplified this by making a custom function fit_correlations_function. This bundles all the processing/calculation steps into a neat function, which you can then run below.

  fit_correlations_function = function(dat, TITLE = ""){
    num = 
      dat %>%       
      dplyr::select(where(is.numeric)) %>%
      drop_na()
    
    num_clean = 
      num %>% 
      rownames_to_column("row") %>% 
      pivot_longer(-row) %>% 
      separate(name, sep = "_", into = c("name")) %>% 
      pivot_wider() %>% 
      dplyr::select(-row)
    
    
    m = cor(num_clean)
    p.mat <- ggcorrplot::cor_pmat(num_clean)
    
    ggcorrplot::ggcorrplot(m, type = "lower",
                           p.mat = p.mat,
                           outline.color = "black",
                           #   lab = TRUE, 
                           insig = "blank",
                           colors = c("#E46726", "white", "#6D9EC1"),
                           title = TITLE)
    
  }

Now, apply the fit_correlations_function function to the dataset to generate the correlation plot.

# this will generate a correlation matrix for the entire BGC dataset
  corr_all = fit_correlations_function(bgc_data %>% dplyr::select(-Lat, -Long, -GWC))

# However, there are a lot of variables here. 
# So, for an easier plot, subset select variables and then re-run the correlation matrix.

data_subset = bgc_data %>% 
  dplyr::select(Ca, Mg, CEC, 
                Clay, SO4S, NH4N,
                TC, TN, Location)

fit_correlations_function(data_subset %>% filter(Location == "TOP"))

Individual Correlations

Here is simple code to visualize correlations among variables. You can also do this in the Shiny App.

bgc_data %>% 
  ggplot(aes(x = TC, y = TN))+
  geom_point()+
  geom_smooth(method = "lm", se = F)+
  labs(x = "Total C (%)",
       y = "Total N (%)")

bgc_data %>% 
  ggplot(aes(x = Clay, y = Sand))+
  geom_point()+
  geom_smooth(method = "lm", se = F)+
  labs(x = "Clay (%)",
       y = "Sand (%)")

bgc_data %>% 
  ggplot(aes(y = CEC, x = Clay))+
  geom_point()+
  geom_smooth(method = "lm", se = F)+
  labs(y = "Cation Exchange Capacity (meq/100g)",
       x = "Clay (%)")

bgc_data %>% 
  ggplot(aes(x = Ca, y = Bases))+
  geom_point()+
  geom_smooth(method = "lm", se = F)+
  labs(y = "Total bases (meq/100g)",
       x = "Calcium (meq/100g)")

bgc_data %>% 
  ggplot(aes(x = TC, y = WEOC))+
  geom_point()+
  geom_smooth(method = "lm", se = F)+
  labs(x = "Total carbon (%)",
       y = "Water-extractable organic carbon (mg/g)")


JITTER PLOTS

These are “jittered” scatter plots showing distribution of the variables by depth or by biome.
Also included is code to calculate the analysis of variance (ANOVA)

1. total carbon (TC)

TC ANOVA

lm((TC) ~ Location * biome_name, data = bgc_data) %>% car::Anova()
## Anova Table (Type II tests)
## 
## Response: (TC)
##                     Sum Sq  Df F value   Pr(>F)    
## Location            151.92   1 24.2377  2.9e-06 ***
## biome_name           92.18   4  3.6768 0.007448 ** 
## Location:biome_name  61.48   4  2.4521 0.049940 *  
## Residuals           714.55 114                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

plot TC by depth

bgc_data %>% 
  ggplot(aes(x = Location, y = TC, color = Location))+
  geom_boxplot(position = position_dodge(width = 0.5), width = 0.3, outlier.colour = NA)+
  geom_jitter(size = 3, width = 0.3)

plot TC by biome

bgc_data %>% 
  ggplot(aes(x = biome_name, y = TC, color = Location))+
  geom_boxplot(position = position_dodge(width = 0.5), width = 0.3)+
  geom_point(size = 3, position = position_dodge(width = 0.5))+
  scale_x_discrete(labels = function(x) str_wrap(x, width = 10))+
  labs(x = "")

2. CEC

CEC ANOVA

lm((CEC) ~ Location * biome_name, data = bgc_data) %>% car::Anova()
## Anova Table (Type II tests)
## 
## Response: (CEC)
##                      Sum Sq  Df F value    Pr(>F)    
## Location              100.0   1  0.7992    0.3732    
## biome_name           3283.7   4  6.5619 8.661e-05 ***
## Location:biome_name   466.0   4  0.9311    0.4486    
## Residuals           14261.9 114                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

plot CEC by depth

bgc_data %>% 
  ggplot(aes(x = Location, y = CEC, color = Location))+
  geom_boxplot(position = position_dodge(width = 0.5), width = 0.3, outlier.colour = NA)+
  geom_jitter(size = 3, width = 0.3)+
  labs(x = "")

plot CEC by biome

bgc_data %>% 
  ggplot(aes(x = biome_name, y = CEC, color = Location))+
  geom_boxplot(position = position_dodge(width = 0.5), width = 0.3)+
  geom_point(size = 3, position = position_dodge(width = 0.5))+
  scale_x_discrete(labels = function(x) str_wrap(x, width = 10))+
  labs(x = "")

3. Clay

Clay ANOVA

lm((Clay) ~ Location * biome_name, data = bgc_data) %>% car::Anova()
## Anova Table (Type II tests)
## 
## Response: (Clay)
##                      Sum Sq  Df F value    Pr(>F)    
## Location              483.7   1  5.2195   0.02412 *  
## biome_name           2574.6   4  6.9449 4.686e-05 ***
## Location:biome_name    88.4   4  0.2385   0.91607    
## Residuals           10936.1 118                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

plot Clay by depth

bgc_data %>% 
  ggplot(aes(x = Location, y = Clay, color = Location))+
  geom_boxplot(position = position_dodge(width = 0.5), width = 0.3, outlier.colour = NA)+
  geom_jitter(size = 3, width = 0.3)+
  labs(x = "")

plot Clay by biome

bgc_data %>% 
  ggplot(aes(x = biome_name, y = Clay, color = Location))+
  geom_boxplot(position = position_dodge(width = 0.5), width = 0.3)+
  geom_point(size = 3, position = position_dodge(width = 0.5))+
  scale_x_discrete(labels = function(x) str_wrap(x, width = 10))+
  labs(x = "")

4. WEOC

WEOC ANOVA

lm((WEOC) ~ Location * biome_name, data = bgc_data) %>% car::Anova()
## Anova Table (Type II tests)
## 
## Response: (WEOC)
##                      Sum Sq  Df F value    Pr(>F)    
## Location            0.16630   1 12.7321 0.0005222 ***
## biome_name          0.14170   4  2.7122 0.0333182 *  
## Location:biome_name 0.05427   4  1.0387 0.3903703    
## Residuals           1.52822 117                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

plot WEOC by depth

bgc_data %>% 
  ggplot(aes(x = Location, y = WEOC, color = Location))+
  geom_boxplot(position = position_dodge(width = 0.5), width = 0.3, outlier.colour = NA)+
  geom_jitter(size = 3, width = 0.3)+
  labs(x = "")

plot WEOC by biome

bgc_data %>% 
  ggplot(aes(x = biome_name, y = WEOC, color = Location))+
  geom_boxplot(position = position_dodge(width = 0.5), width = 0.3)+
  geom_point(size = 3, position = position_dodge(width = 0.5))+
  scale_x_discrete(labels = function(x) str_wrap(x, width = 10))+
  labs(x = "")


MAPS

The {ggplot2} package has functionality to easily plot maps (without downloading any other files or packages!).

We will plot our data over a map of the USA.
To do so, first load the states_map file, which contains the data for the state outlines.
Then, plot the state boundaries using the geom_polygon function.

states_map <- ggplot2::map_data("state")

ggplot(states_map, aes(long, lat, group = group))+
  geom_polygon(color = "grey", fill = NA)

Now, to plot our data over this map.
We add the geom_point layer over the base polygon layer of the states.

ggplot(states_map, aes(long, lat, group = group))+
  geom_polygon(color = "grey", fill = NA)+
  geom_point(data = bgc_data, 
             aes(x = Long, y = Lat, 
                 fill = TC, group = 1), 
             size = 6, shape = 21, color = "black")

The default {ggplot2} color palette may be difficult to interpret.
As an alternative, we use the {soilpalettes} package.

ggplot(states_map, aes(long, lat, group = group))+
  geom_polygon(color = "grey", fill = NA)+
  geom_point(data = bgc_data %>% filter(Location == "TOP"), 
             aes(x = Long, y = Lat, color = TC, group = 1), 
             size = 6)+
  scale_color_gradientn(colors = soil_palette("redox2", 5))+
  labs(color = "TC, %")+
  theme_void(base_size = 16)+
  theme(legend.position = "bottom")

ggplot(states_map, aes(long, lat, group = group))+
  geom_polygon(color = "grey", fill = NA)+
  geom_point(data = bgc_data %>% filter(Location == "TOP"), 
             aes(x = Long, y = Lat, color = WEOC, group = 1), 
             size = 6)+
  scale_color_gradientn(colors = soil_palette("redox2", 5))+
  labs(color = "WEOC, mg/g")+
  theme_void(base_size = 16)+
  theme(legend.position = "bottom")

ggplot(states_map, aes(long, lat, group = group))+
  geom_polygon(color = "grey", fill = NA)+
  geom_point(data = bgc_data %>% filter(Location == "TOP"), 
             aes(x = Long, y = Lat, color = CEC, group = 1), 
             size = 6)+
  scale_color_gradientn(colors = soil_palette("redox2", 5))+
  labs(color = "CEC, meq/100g")+
  theme_void(base_size = 16)+
  theme(legend.position = "bottom")

ggplot(states_map, aes(long, lat, group = group))+
  geom_polygon(color = "grey", fill = NA)+
  geom_point(data = bgc_data %>% filter(Location == "TOP"), 
             aes(x = Long, y = Lat, color = Clay, group = 1), 
             size = 6)+
  scale_color_gradientn(colors = soil_palette("redox2", 5))+
  labs(color = "Clay, %")+
  theme_void(base_size = 16)+
  theme(legend.position = "bottom")

Another map example, showing the different biomes

ggplot(states_map, aes(long, lat, group = group))+
  geom_polygon(color = "grey", fill = NA)+
  geom_point(data = bgc_data, 
             aes(x = Long, y = Lat, 
                 color = biome_name, group = 1), 
             size = 6)+
  scale_color_manual(values = soil_palette("redox2", 5))+
  theme_void(base_size = 16)+
  labs(fill = "")


TEXTURE

We plot % sand-silt-clay on a texture triangle (ternary plot) to determine the soil texture. We use the ggtern package for this.

You first plot the base ternary triangle. You will need to load the USDA dataset from ggtern.

data(USDA)

 texture_base = 
  ggtern() +
    geom_polygon(data = USDA, 
                 aes(x = Sand, y = Clay, z = Silt, group = Label),
                 fill = NA, size = 0.3, alpha = 0.5, color = "grey30")+
    theme_bw()+
    theme_showarrows()+
    theme_hidetitles()+
    theme_clockwise()

texture_base

then overlay your data points

texture_base +
  geom_point(data = bgc_data,
             aes(x = Sand, y = Clay, z = Silt),
             size = 4, color = "black")

Now, color the data points by CEC

texture_cec = 
  texture_base +
  geom_point(data = bgc_data %>% filter(!is.na(CEC)),
             aes(x = Sand, y = Clay, z = Silt, color = CEC),
             size = 4, )+
  scale_color_gradientn(colors = soil_palette("redox2", 5))+
  labs(fill = "CEC (meq/100g)")

Now, we need to add the texture labels.
Create a new dataframe from the USDA dataset (included with ggtern). Then add that to the previous plot.

  USDA_text <- 
    USDA  %>% 
    group_by(Label) %>%
    summarise_if(is.numeric, mean, na.rm = TRUE) %>%
    ungroup()
  
  #textures_names<-
  texture_cec +
    geom_text(data = USDA_text,
              aes(x = Sand, y = Clay, z = Silt, label = Label),
              size = 4, color = "black")